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Abstract 

Study of the cluster- or community structure of complex networks makes an impor- 
tant contribution to the understanding of networks at a functional level. Despite the 
many efforts, no definition of community has been agreed on and important aspects 
such as the statistical significance and theoretical limits of community detection 
are not well understood. We show how the problem of community detection can be 
mapped onto finding the ground state of an infinite range spin glass. The ground 
state energy then corresponds directly to the quality of the partition. The network 
modularity Q previously defined by Girvan and Newman [1] turns out to be a spe- 
cial case of this spin glass energy. Through this spin glass analogy, we are able to 
give expectation values for the modularity of random graphs that can be used in 
the assessment of the statistical significance of real network clusterings. Further, it 
allows for assessing the theoretical limits of community detection. 
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1 Introduction 



With tlie increasing availability and steadily increasing size of relational data- 
sets or networks the need for appropriate methods for exploratory data anal- 
ysis arises. For general statistical properties such as the degree distribution, 
degree correlations, clustering etc. a number of well established methods and 
models to explain their origin exist [2,3]. However, a standard analysis for the 
higher order structure in graphs has not been established so far. Currently, 
the problem of the cluster or community structure is subject of intense study 
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[4,5]. Cluster analysis is an important technique that allows for data abstrac- 
tion, dimensionality reduction or aids in data visualization. It is used in life 
sciences [6] , over bibliometrics [7] , to market research [8] , and has implications 
for experiment planning, funding policies or marketing. 

However, some important aspects of the problem are still not clearly under- 
stood. First, there is little agreement with regard to the definition of a com- 
munity or cluster in a network and second, and more importantly, no adequate 
measure of the statistical significance of network clustering exists. This article 
aims at contributing to both of these questions. 



2 What is a community? 

Despite the many applications of community detection across the sciences, it 
remains remarkably unclear what a community actually is. Additionally to the 
many definitions that are given in sociology [9], the physics community has 
contributed a fair number as well [4,5]. All authors agree that a community 
should be a group of nodes that is more densely connected among each other 
than with the rest of network, but differ largely in the details. Below, we give a 
short overview of the different aspects that have been emphasized by different 
authors. 

The initial work on communities by Girvan and Newman [10] gives an algo- 
rithmic definition. They design a community detection algorithm which recur- 
sively partitions the graph to produce a hierarchy of communities from the 
entire network down to single nodes. At each point, the nodes belonging to 
distinct sub-trees in the resulting dendogram are considered as communities. 

Radicchi et al. [11] tried to improve this heuristic definition by coining the 
term of "community in a strong sense" such that 

kf>k^',yieC. (1) 

This means for all nodes i in the community C, the number of connections 
node i has to members of its own community kf^ is larger than /c°"*, the 
number of connections is has to the rest of the network. Further, they define 
a "community in a weak sense", such that the sum of internal connections 
is larger than the sum of external hnks J^iec > X^iec K^*- Radicchi et al. 
then suggest to stop any recursive partitioning algorithm when an additional 
partition would not result in a community in the strong (or weak) sense. 

Palla et al. [7,12] have given a definition based on reachability. They define a 
sub-graph percolation process based on k-cliques (fully connected subgraphs 
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with k nodes). Two k-cliques are connected, if they share a (k-l)-chquc, e.g. 
two triangles (which arc 3-cUqucs) are connected if the share an edge (a 2- 
chque). A community, or k-chque percolation cluster, is then defined as the 
group of nodes that can be reached via adjacent k-cliques. Communities may 
overlap, i.e. nodes may belong to more than one percolation cluster, but com- 
munities corresponding to (k-l-l)-clique percolation clusters always lie com- 
pletely within k-clique clusters. 

Girvan and Newman have further defined a quantitative measure of the qual- 
ity of an assignment of nodes into communities. This so-called "modularity" 
[1] can be used to compare different assignments of nodes into communities 
quantitatively. The modularity is defined as: 



The sum runs over all communities s. The fraction of all links connecting 
nodes in group s and r is denoted by Cgr- Hence, e^s is the fraction of all links 
lying within group s. The fraction of all links connecting to nodes in group 
s is denoted by = X^r ^rs- One can interpret aj. as the expected fraction 
of internal links in group s, if the network was random and the nodes were 
distributed randomly into the different groups. Such a measure can be used 
to stop recursive partitioning or agglomerative approaches when they do not 
lead to an improvement of Q anymore [13]. 

We see the diversity of definitions and approaches of which we have described 
only a few. References [4,5] give a more comprehensive overview. Because of 
this controversy of opinions, we will set out from a first principles approach 
in the next section that will shed some light on the general properties of the 
problem. 



3 A first principles approach to community detection 

Instead of defining what a community is and then trying to devise an algorithm 
in order to detect it, we use a different approach. We start from a simple 
principle: to group nodes that are not linked in different communities and to 
put nodes which are linked in the same community. With this principle, we 
write the following Hamiltonian: 



ss 




(2) 



s 



n{a) 



dijAjSicTi, aj) + E bij{l - Aij)S{ai, aj). 



(3) 



i<j i<j 
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Here, o"j denotes the group index of node i, d{ai,aj) is the Kronecker deha, 
Aij is the adjacency matrix of the network with Aij = 1 if nodes i and j 
are connected and zero otherwise. Hence, the first sum runs over aU pairs 
of connected nodes, while the second sum runs over all pairs of unconnected 
nodes. Our Hamiltonian rewards every pair of connected nodes in the 
same group with aij and penalizes every pair of unconnected nodes {i.j) in 
the same community with bij. It implements just the principle we started out 
from. Any spin configuration that will minimize (3) is hence optimal in the 
sense of this first principle. It is now important to define the weights Ojj and 
bij in a sensible way. A particular good choice is to balance them, such that all 
existing connections in the network are equally important to our optimality 
criterion as all missing connections, of which there are generally many more: 

Y,a,jA,j = Y.b^,{l-Aj). (4) 

i<j i<j 



One way of satisfying this equation is to set a^j = 1 — 'ypij and bij = ^Pij 
which also reduces the need for two difi^erent weights to only one. We have 
introduced an additional parameter 7 that will allow us to adjust the balance 
of missing and existing links. The only constraint we have to impose on pij 
in order to fulfill (4) is that J2i<jPij — M with M being the total number 
of links in the network. With this choice, we can now rewrite (3) in a much 
simpler form: 



Equation (5) is formally identical to the Hamiltonian for a g-state Potts spin 
glass, with q being the number of possible group indices. The coupling matrix 
is then defined as Jij = Aij —jpij. Though pij can take any form, it is sensible 
to identify it with the connection probability between nodes i and j in the 
network. Depending on the network under study, this can be 

Pij = P, (6) 



if the links are assumed to connect nodes with constant probability p — 
2M/N[N — 1). Another possible choice is 



if the degree distribution of the nodes is to be taken into account and there 
are no degree-degree correlations. Here ki denotes the degree of node i and M 
represents the number of links in the network as before. 
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Both of these choices render pij positive and smaller than one, hence we are 
dealing with a spin glass which has ferromagnetic couplings between con- 
nected node and anti-ferromagnetic couplings between unconnected nodes. 
The ground state of this spin glass defines the optimal assignment of nodes 
into communities. Note, that for 7 = 1 and pij — kikj/2M, we recover the 
modularity Q defined by Girvan and Newman [1] from (5) via Q — —j^H [14]. 

It is worth rewriting (5) as a sum over spin states s: 

= - {rriss - l[mss\pi,) = ^ (m^, - j[mss\pij) ■ (8) 

s ^ ' s<r^ V ' 

Cgs O/vs 



We denote the number of links within group s by m^s and between groups r and 
s by rrirs- Further, we denote the expectation values of these quantities under 
the model of connection probability Pij and assuming a random assignment 
of spins by [■]pij. In (8) we have introduced two new terms Css and a^s which 
measure within group "cohesion" and between group "adhesion" , respectively. 
Note that maximizing cohesion and minimizing adhesion are in fact equivalent 
and will hence always be extremal at the same time, i.e. any configuration of 
spins that minimizes Ti will automatically maximize cohesion and minimize 
adhesion. 

In particular, we have for the two connection models introduced above 

[iT^ssUj = P^^^^^^-^ — —, and [mrs]pij = pn^rir (9) 



for Pij = p. The number of nodes in group s is denoted by the occupation 
number [15]. For p^j = kikj/2M we find 



1 

[mss]pij = [rrirsUj = l^^sKr (10) 



where Kg denotes the sum of degrees of nodes in group s in a similar way as 
the occupation number Us- 

We have thus shown, that finding the community structure of a network is 
equivalent to finding the ground state of an infinite range spin glass. Note 
that non-zero couplings exist between all pairs of nodes. Fortunately, the par- 
ticular choice of allows us to implement efficient optimization routines 
[16] that only need to consider interactions along the links and treat the anti- 
ferromagnetic interactions along the non existing links in a mean field manner, 
which is, however, not an approximation but accounts exactly for the repulsive 
interactions. One only needs to keep track of the occupation numbers of the 
spin states or the total sum of degrees in each group. 
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We are now able to give a definition of community which follows directly from 
the properties of the ground state as a global minimum of (5). We define 
as the community structure of a graph an assignment of nodes into groups 
(spin states) that makes (5) minimal. Such an assignment then possesses the 
following properties: 

(1) Every proper subset rii of a community Ug has a maximum cocfiicicnt of 
adhesion with its complement in the community compared to the coeffi- 
cient of adhesion with any other community (oi^^^i = max). 

(2) The coefficient of cohesion is non- negative for all communities {cgg > 0). 

(3) The coefficient of adhesion between any two communities is non-positive 

{ttrs < 0). 

This also defines the term "community". A community is a group of nodes 
that has the above three properties. 

If the ground state is degenerate, i.e. different assignments of nodes into com- 
munities lead to the same ground state energy, this allows us to define over- 
lapping community structure in a natural way. Degeneracy may occur in two 
different forms. On one hand it may be possible to move part of a commu- 
nity a to another community b without increasing the energy. We say the two 
communities a and b overlap, since the total number of communities stays 
constant. On the other hand, it may be that one may split a community a 
into two or more communities or join it with another community b without 
increasing the energy. Since the number of communities changes, we speak of 
overlapping community structures. Naturally, all groups of nodes with equal 
spin value in any configuration that represents a local minimum of (5) will also 
qualify as communities. They can be regarded as sub-optimal assignments and 
the study of their overlap among each other and with the ground state yields 
valuable information about how many alternative, but sensible groupings exist 
for a particular network [15,16] 

The ground state depends on the value 7 chosen. The value of 7 at which 
the community structure was obtained should always be quoted. Changing 
the value of 7 allows to detect hierarchies in the assignment of nodes into 
communities [15,16]. 

We benchmarkcd the performance of this approach to community detection 
on computer generated test networks [15] and compared the results to those 
obtained by Girvan and Newman's betweenness algorithm [10]. The networks 
are Eros-Renyi (ER) graphs [17] with an average degree of (k) — 16 and 128 
nodes. The nodes were divided into 4 groups of 32 nodes each. Keeping the 
average degree fixed, the links per node were distributed into and average 
of (kin) to members of the same group and an average of (kout) to members 
of the 3 remaining groups in the network such that (kout) + {kin) = {k)- 
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Fig. 1. Benchmarks of a community detection algorithms based on finding the 
ground state of a spin glass and comparison with Girvan-Newman's Algorithm 
[10]. Tests were run on computer generated test networks with known community 
structure. "Sensitivity" denotes the fraction of all pairs of nodes that are classified 
correctly in the same community, while "specificity" denotes the fraction of all pairs 
of nodes classified correctly in different communities. 

Obviously, increasing (kout) on the expense of (kin) makes the recovery of the 
designed community structure more difficult. At {kin) — 4 the network should 
be completely random and any trace of the built-in community structure is 
lost since at this point the probability to link to a member of a different node 
equals the probability to link to a member of the same group pj„ = Pout = P- 

Figure 1 shows the results of the benchmarks. We measured the success of 
the two methods in terms of sensitivity and specificity. Sensitivity measures 
the fraction of pairs of nodes that are correctly classified as being in the same 
cluster, while specificity measures the fraction of nodes correctly classified as 
belonging to different clusters. In other words, the two measures indicate how 
good the algorithms are in grouping together what belongs together and in 
keeping apart what does not belong together. From Figure 1 we see, that both 
algorithms are rather conservative in terms of grouping things together as 
indicated by the high levels of specificity. The change in sensitivity is much 
more drastic and we find that the Potts model approach outperforms the 
algorithm of Girvan and Newman [10]. The critical value of {kin)c at which 
the ability to recover the built in community structure seems to be {kin)c = 8. 



4 Communities and Modularities in Random Networks 



In our introductory paragraphs, we have already raised the question when 
one may call a network truly modular. Obviously, running a clustering al- 
gorithm over a set of randomly generated data points will always produce 
clusters which, however, have little meaning. Similarly, minimizing the modu- 
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larity Hamiltonian on a random graph results in a community structure which 
has all the desired properties. This does, of course, not mean that the graph 
we studied was in fact modular. A differentiation between graphs which are 
truly modular and those which are not can hence only be made if we gain 
an understanding of the intrinsic modularity of random graphs. By comparing 
the modularity of random graphs with that of real world graphs, we can assess 
whether the graphs under study are truly modular. 

Such a comparison can of course always be made by randomizing the network 
under study keeping the degree distribution invariant. Such algorithms then 
remove all correlations and community structures possibly present in the data. 
Comparing the results of clustering the empirical data and a randomized ver- 
sion of it can always give a clue to what extent the data shows modularity 
above that of a random network with the same degree distribution. Neverthe- 
less, such analysis is biased by the algorithm used to detect the community 
structure. Much more desirable would be a measure of modularity that can 
be used to compare with any algorithm. 

In mapping the problem of finding a community structure onto finding the 
ground state of an infinite range spin glass, we have defined a coupling matrix 
Jij with the following distribution of couplings: 



where we have set 7 = 1 and assumed we are dealing with a random network in 
which the links are distributed with the same pij we use for defining the weights 
ttij and hij of the contributions of existing and missing links in the clustering. 
It is easy to see that this distribution has zero mean. Since the mean of the 
distribution of couplings couples only to the magnetization, we find a zero 
magnetization in the ground state [18]. This corresponds to an equi-partition 
of the network. The community structure of a random network consists of 
all equal sized communities. A symmetry argument can also be invoked to 
understand this. In an uncorrelated random graph, there is no reason for 
a particular size of communities and hence, they must be of equal size. If 
we conceive community detection as looking for the "natural partition" of a 
network, then the natural partition of a random graph is the equi-partition. 

For the number of edges to cut when equi-partitioning a random graph, a 
number of results exist since the 1980's, beginning with the paper by Fu and 
Anderson [18] about bi-partitioning a random graph. Kanter and Sompolinsky 
[19] have given an expression for the minimum total number of inter commu- 
nity edges C, also called cut-size, when partitioning a random graph into q 



p{Jij) =PijS{Jij - {I- Pij)) + {l-pij)6{Jij +Pij), 
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q 


2 


3 


4 


5 


6 


7 


8 


9 


10 


U{q)/q 


0.384 


0.464 


0.484 


0.485 


0.479 


0.471 


0.461 


0.452 


0.442 



Table 1 

Values of U{q)/q for various values of q obtained from [19], which can be used to 
approximate the expected modularity with equation (13). 

equal sized parts: 



min[C{{N/q})] 
Wi} 



N'p{<l 1) 
2q 



U{q) 



[12) 



The minimum is taken over all possible spin configurations {a} with equal 
occupation numbers Ug = N/q. The first term in the expression is the ex- 
pectation value of C for a random assignment of spin states and the second 
term is a correction due to optimization of the configuration which depends 
on the standard deviation of the coupling matrix J and a constant depending 
on the number of parts q. In case of an ER-graph, the standard deviation of 
the coupling matrix is given by given by J = y^p(T^-p) with p denoting the 
average connection probability in the network given hj p = 2M/N{N — 1). 
From this, we can immediately write an expectation value for the modularity 
of random graphs: 

1 Ar3/2 jjr \ 



For the U (q) , the ground state energy of a q-state Potts model with Gaussian 
couplings of zero mean and variance J^, some values for small q are given in 
Table 1 obtained by using the exact formula for calculating U{q) from [19]. 
For large q, we can approximate U{q) — \/q \n.q [19]. 

We see that maximum modularity is obtained at g = 5, though the value of 
U{q)/q ioi q = A is not much different from it. This qualitative behavior of 
dense random graphs tending to cluster into only a few large communities is 
confirmed by our numerical experiments. Using the largest value of Table 1, 
we finally arrive at an expression for the modularity that we can expect in 
any ER random graph with average degree (k) = pN: 



Figure 2 shows the comparison of equation (14) and experiments where we 
have numerically maximized the modularity in random graphs with = 
10, 000 nodes and varying connectivity {k) using a simulated annealing ap- 
proach as described in an earlier section. 
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Fig. 2. Modularity and the number of communities in ER-Graphs. Shown are the 
values determined from clustering random graphs with = 10, 000 nodes and the 
expectation values calculated from using a Potts-Model (14) or an Ising-Model (17) 
recursively. 

The above approximation using a Potts spin glass, however, cannot explain 
the number of communities found experimentally in random graphs of vary- 
ing connectivity since it always assumes 5 communities. Therefore, we try 
to approximate the ground state of a g-state Potts model by recursively bi- 
partitioning the network and continuing as long as the modularity increases. 
For every bi-partition we use the expression of the cut-size as a function of the 
number of N nodes and average degree (k) = pN given by Fu and Anderson 
[18]: 



min [C{{N/2})] = - 

The constant c corresponds to f/(2) and is given by c = 1.5266 ± 0.0002 [18]. 
After every partition, the number of links connecting to nodes in the same 
part and to nodes in the rest of the network is given by: 

pAr + cy'pAr(l-p) pN-c^vN{\-v) 
{kin) = ^ and {kout) = • (16) 

After b successive recursive partitions, we arrive at a modularity of 

W = ^-^E(W (17) 



1-c, 



'1 -p 
pN 



(15) 



where (k) is the average degree in the total network and {kout,t) is the average 
number of external links a node gains after partition number t calculated from 
(16). 

Though equation (17) only allows numbers of communities that are powers of 
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2, the agreement with the experimental data is surprisingly good as Figure 2 
shows. Also, the number of communities is predicted almost perfectly by (17) 
as shown in Figure 2. 

With the expression (14) and (17), we are adequately able to calculate expec- 
tation values of Q for random graphs which can be used in the assessment of 
the statistical significance of the modularity in real world networks. We have 
shown, that random graphs may exhibit considerable values of modularity 
even without any built-in group structure. Significant community structure 
can hence only be attributed to graphs with values of modularity higher than 
those calculated for random graphs. The sparser a graph, the higher the mod- 
ularity of its randomized equivalent. It is hence especially difficult to detect 
true modularity in sparse graphs. Also, the sparser a graph, the more modules 
it will show, while dense random graphs tend to cluster into only a hand full 
of communities. 



5 Theoretical Limits of Community Detection 

With the results of the last section we are now in the position to explain 
Figure 1 and to give a limit to which extent a designed community structure 
in a network can be recovered. As we have seen, for any random network we 
can find an assignment of spins in communities that leads to a modularity 
Q > 0. For our computer-generated test networks with (k) = 16 we have a 
value of p = {k)/{N — 1) = 0.126 and expect a value oi Q — 0.227 according 
to (14) and Q — 0.262 according to (17). The modularity of the community 
structure built in by design is given by: 

Qiikin)) = ^ - I- (18) 

Hence, below (kin) = 8, we have a designed modularity that is smaller than 
what can be expected from a random network of the same connectivity! This 
means that the minimum in the energy landscape corresponding to the com- 
munity structure that we design is less deep than those that one can find in 
the energy landscape defined by any network. It must be understood that in 
the search for the built in community structure, we are competing with those 
community structures that arise from the fact that we are optimizing for a 
particular quantity in a very large search space. In other words, any network 
possesses a community structure that exhibits a modularity at least as large 
as that of a completely random network. If a community structure is to be 
recovered reliably, it must be sufficiently pronounced in order to win the com- 
parison with the structures arising in random networks. In the case of the 



11 



5 




♦ ER-Graphs 

— Ising Prediction 

- - Potts Prediction 







10 



100 



pN=<k> 



Fig. 3. Ratio of internal links to external links kin/kout in the ground state of the 
Hamiltonian. Shown are the experimental values from clustering random graphs 
with N = 10, 000 nodes and the expectation values calculated from using a 
Potts-Model (14) or an Ising-Modcl (17) recursively. The dotted line represents 
the Radicchi et al. definition of community in "strong sense" [11]. Note that sparse 
graphs will, on average, always exhibit such communities, while dense graphs will 
not, even though their modularity may be well above the expectation value for an 
equivalent random graph. 

test networks employed here, there must be more than k, 8 intra-community 
links per node. Figure 3 again exemplifies this. We see that random networks 
with {k) = 16 are expected to show a ratio of internal and external links 
kin/kout ~ 1- Networks which are considerably sparser have a higher ratio 
while denser networks have a much smaller ratio. This means that in dense 
networks, we can recover designed community structure down to relatively 
smaller (kin). Consider for example large test networks with (k) — 100 with 4 
built-in communities. For such networks, we expect a modularity of Q ~ 0.1 
and hence the critical value of intra-community links to which the community 
structure could reliably be estimated would be {kin)^ = 35 which is much 
smaller in relative comparison to the average degree in the network. 

This also means, that the point at which we cannot distinguish between a 

random and a modular network is not defined by j9j„ = Pout = P for the internal 
and external link densities as one may have intuitively expected. Rather, it is 
determined by the ratio of (kin) /{{k) — {kin)) in the ground state of a random 
network and depends on the connectivity of the network {k). 

Finally, from Figure 3 we observe that sparse random graphs all show commu- 
nities in the strong sense of Radicchi et al. [11]. Further, it is very difficult to 
find communities in the strong sense in dense graphs, even though they may 
exhibit a modularity well above that of a random graph. 
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6 Conclusion 

Starting from a simple principle, we have shown how the problem of commu- 
nity detection can be mapped onto finding the ground state of an infinite range 
spin glass. The quahty function of the clustering is identified as the ground 
state energy of this spin glass. Benchmarks show the good performance of al- 
gorithms based on this mapping. The network modularity Q defined by Girvan 
and Newman is identified as a special case of this approach. The comparison 
with appropriate random graphs allows the assessment of the statistical sig- 
nificance of community structures found in real world networks. Expectation 
values for the modularity of Erdos-Renyi random graphs were given. The the- 
oretical limits of community detection were addressed and we found that only 
those community structures can be recovered reliably that lead to modularities 
larger than the expectation values of random graphs. 
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